Main elements of the INLA methodology
- Latent Gaussian Models
- Sparse matrices
- Laplace approximations
- …other “technical” details for the special interested!
Latent Gaussian Models and INLA
Everything in INLA is based on so-called latent Gaussian models
- A few hyperparameters \(\theta\sim\pi(\theta)\) control variances, range and so on
- Given these hyperparameters we have an underlying Gaussian distribution \(\mathbf{u}|\theta\sim\mathcal{N}(\mathbf{0},\mathbf{Q}^{-1}(\theta))\) that we cannot directly observe
- Instead we make indirect observations \(\mathbf{y}|\mathbf{u},\theta\sim\pi(\mathbf{y}|\mathbf{u},\theta)\) of the underlying latent Gaussian field
Models of this kind: \[ \begin{aligned} \mathbf{y}|\mathbf{u},\theta &\sim \prod_i \pi(y_i|\eta_i,\theta)\\ \mathbf{\eta} & = A_1\mathbf{u}_1 + A_2\mathbf{u}_2+\dots + A_k\mathbf{u}_k\\ \mathbf{u},\theta &\sim \mathcal{N}(0,\mathbf{Q}(θ)^{−1})\\ \theta & \sim \pi(\theta) \end{aligned} \]
occurs in many, seemingly unrelated, statistical models.
In this talk we will focus on the characteristics of \(\mathbf{u}|\theta\)
For many interesting applications it is necessary to solve large problems.
- With large problems we think of models where the size \(n\) of the latent field \(\mathbf{u}\) is between 100 to 100, 000. This includes fixed effects, spatial effects and so on.
- Computations with models of this size are cumbersome!
- The solution in the INLA world are Gaussian Markov random fields (GMRFs)!
A Gaussian distribution \(\mathbf{u}\sim\mathcal{N}(0,\Sigma)\) is controlled by:
— The mean vector \(\mu\), which we have set equal to 0. This is the centre of the distribution
— The matrix \(\Sigma\) describes all pairwise covariances \(\Sigma_{ij} = \text{Cov}(u_i, u_j)\)
Formally \[ \pi(\mathbf{u}) = \frac{1}{2\pi|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}\mathbf{u}^T\Sigma^{-1}\mathbf{u}\right\} \]
| Advantages 😄👍 | Disadvantages 😔👎 |
|---|---|
| mostly theoretical | mostly computational |
| Analytically tractable. | \(\Sigma\) usually is large and dense |
| We have a good understanding of its properties. | Computations scale as \(\mathcal{O}(n^3)\) |
| Basically, it’s easy to do the theory | Not feasible for large problems |
We need to reduce the computational burden !
A matrix \(\mathbf{Q}\) is called sparse if most of its elements are zero.
\[ \mathbf{Q} = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0& 2& 0 & 0\\ 0& 0& -1 & 0\\ 0& 1& 0 & 0.4\\ \end{bmatrix} \]
— There exist very efficient numerical algorithms to deal with sparse matrices:
— In order to solve large problems we need to exploit some property to make the computations feasible
Which matrix should be sparse in a Gaussian field?
\[ \mathcal{u}\sim\mathcal{N}(0,\Sigma) \]
Force the covariance matrix \(\Sigma\) to be sparse
Force the precision matrix \(\Sigma^{-1}\) to be sparse
\[ \mathcal{u}\sim\mathcal{N}(0,\Sigma) \]
Force the covariance matrix \(\Sigma\) to be sparse
Force the precision matrix \(\mathbf{Q} = \Sigma^{-1}\) to be sparse
\[ \mathcal{u}\sim\mathcal{N}(0,\Sigma) \]
Force the covariance matrix \(\Sigma\) to be sparse
Force the precision matrix \(\mathbf{Q} = \Sigma^{-1}\) to be sparse
Definition
\[ \begin{aligned} \mathbf{i=1}&: u_1 \sim\mathcal{N}(0, \frac{1}{1-\phi^2})\\ \mathbf{i=2,\dots,T}&: u_i = \phi\ u_{i-1} +\epsilon_i,\ \epsilon_i\sim\mathcal{N}(0,1) \end{aligned} \]
Covariance Matrix
\[ \Sigma = \frac{1}{1-\phi^2} \begin{bmatrix} 1& \phi & \phi^2 & \dots& \phi^N \\ \phi & 1& \phi & \dots& \phi^{N-1} \\ \phi^2 & \phi & 1 & \dots& \phi^{N-2} \\ \dots& \dots& \dots& \dots& \dots& \\ \phi^{N} & \phi^{N-1}& \phi^{N-2} & \dots& 1\\ \end{bmatrix} \]
This is a dense matrix.
All elements of the \(\mathbf{u}\) vector are dependent.
Precision Matrix
\[ \mathbf{Q} = \Sigma^{-1} = \begin{bmatrix} 1& -\phi & 0 & 0 &\dots& 0 \\ -\phi & 1 + \phi^2& -\phi & 0 & \dots& 0 \\ 0 & -\phi & 1-\phi^2 &-\phi & \dots& 0 \\ 0 & 0 & -\phi &1-\phi^2 & \dots & \dots \\ \dots& \dots& \dots& \dots& \dots& \dots& \\ 0 &0 & 0 & \dots & -\phi& 1\\ \end{bmatrix} \]
This is a tridiagonal matrix, it is sparse
The tridiagonal form of \(\mathbf{Q}\) can be exploited for quick calculations.
What is the key property of this example that causes \(\mathbf{Q}\) to be sparse?
The key lies in the full conditionals
\[ u_t|\mathbf{u}_{-t}\sim\mathcal{N}\left(\frac{\phi}{1-\phi^2}(x_{t-1}+x_{t+1}), \frac{1}{1+\phi^2}\right) \]
Each timepoint is only conditionally dependent on the two closest timepoints
It is useful to represent the conditional independence structure through a graph
Theorem: \(u_i\perp u_j|\mathbf{u}_{-ij}\Longleftrightarrow Q{ij} =0\)
This is the key property! 😃
A GMRF is a Gaussian distribution where the non-zero elements of the precision matrix are defined by the graph structure.
In the previous example the precision matrix is tridiagonal since each variable is connected only to its predecessor and successor.
\[ \mathbf{Q} = \Sigma^{-1} = \begin{bmatrix} 1& -\phi & 0 & 0 &\dots& 0 \\ -\phi & 1 + \phi^2& -\phi & 0 & \dots& 0 \\ 0 & -\phi & 1-\phi^2 &-\phi & \dots& 0 \\ 0 & 0 & -\phi &1-\phi^2 & \dots & \dots \\ \dots& \dots& \dots& \dots& \dots& \dots& \\ 0 &0 & 0 & \dots & -\phi& 1\\ \end{bmatrix} \]
We need to compute determinants of large, sparse matrices. This is needed for likelihood calculations.
We need to compute square roots of large, sparse matrices. This is needed for likelihood calculations and simulations.
Technically:
Compare this with \(\mathcal{O}(n^{3})\) operations in the general case.
Gaussian Markov Random Fields (GMRF):
Definition
A random variable \(\mathbf{u}\) is said to be a Gaussian Markov random field (GMRF) with respect to the graph \(\mathcal{G}\), with vertices \(\{1, 2,\dots , n\}\) and edges \(\mathcal{E}\), with mean \(\mu\) and precision matrix \(\mathbf{Q}\) if its probability distribution is given by \[ \pi(\mathbf{u}) = \frac{|\mathbf{Q}|^{1/2}}{(2\pi)^{n/2}}\exp\left\{ -\frac{1}{2}(\mathbf{u}-\mu)^T\mathbf{Q}(\mathbf{u}-\mu)\right\} \] and \(Q_{ij} \neq 0\Longleftrightarrow \{i,j\}\in\mathcal{E}\)
The latent field \(\mathbf{u}\) is a GMRF \[ \pi(\mathbf{u}|\theta)\propto\exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u}\right\} \]
The latent field \(x\) is a GMRF \(\pi(\mathbf{u}|\theta)\propto\exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u}\right\}\)
We can factorize the likelihood as: \(\pi(\mathbf{y}|\mathbf{u},\theta) = \prod_i\pi(y_i|\eta_i,\theta)\)
The latent field \(\mathbf{u}\) is a GMRF \(\pi(\mathbf{u}|\theta)\propto\exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u}\right\}\)
We can factorize the likelihood as: \(\pi(\mathbf{y}|\mathbf{u},\theta) = \prod_i\pi(y_i|\eta_i,\theta)\)
The vector of hyperparameters \(\theta\) is low dimensional.
Main Inferential Goal:
\[\begin{aligned} \overbrace{\pi(\mathbf{u}, {\theta}\mid\mathbf{y})}^{{\text{Posterior}}} &\propto \overbrace{\pi({\theta}) \pi(\mathbf{u}\mid{\theta})}^{{\text{Prior}}} \overbrace{\prod_{i}\pi(y_i \mid \eta_i, {\theta})}^{{\text{Likelihood}}} \end{aligned}\]Assumption
We are mainly interested in posterior margianals \(\pi(u_i|\mathbf{y})\) and \(\pi(\theta_j|\mathbf{y})\)
Strategy
We want to compute \(\pi(u_i|\mathbf{y})\)
Hard way 👎
\[ \pi(u_i|\mathbf{y}) = \int\pi(\mathbf{u}|\theta)\ d\mathbf{u}_{-i} \]
We want to compute \(\pi(u_i|\mathbf{y})\)
Hard way 👎
\[ \pi(u_i|\mathbf{y}) = \int\pi(\mathbf{u}|\mathbf{y})\ d\mathbf{u}_{-i} \]
Easier way 👍
\[ \pi(u_i|\mathbf{y}) = \int\pi(u_i|\theta, \mathbf{y})\pi(\theta|\mathbf{y)}\ d\theta \]
We want to build an approximation to \(\pi(\theta|\mathbf{y})\)
\[ \pi(\mathbf{\theta} \mid \mathbf{y}) = \frac{ \pi(\mathbf{u}, \mathbf{ \theta }\mid \mathbf{y})} {\pi(\mathbf{u} \mid \mathbf{y}, \mathbf{\theta})} \: \Bigg|_{\text{This is just the Bayes theorem! It is valid for any value of } \mathbf{u}} \]
We want to build an approximation to \(\pi(\theta|\mathbf{y})\)
\[\begin{aligned} \pi(\mathbf{\theta} \mid \mathbf{y}) & = \frac{ \pi(\mathbf{u}, \mathbf{ \theta }\mid \mathbf{y})} {\pi(\mathbf{u} \mid \mathbf{y}, \mathbf{\theta})} \\ & \propto \frac{ \overbrace{\pi( \mathbf{y}\mid \mathbf{u},\mathbf{\theta})\ \pi(\mathbf{\theta})}^{\text{Non-Gaussian, KNOWN}} \overbrace{\pi( \mathbf{u}\mid \mathbf{\theta})}^{\text{ Gaussian, KNOWN}} } {\underbrace{\pi(\mathbf{u} \mid \mathbf{y}, \mathbf{\theta})}_{\text{Non-Gaussian,UNKNOWN}}} \end{aligned}\]\[ \begin{aligned} \pi(\mathbf{u}\mid {\theta},\mathbf{y}) & \propto \pi(\mathbf{u}\mid {\theta}) \pi(\mathbf{y}|\mathbf{u},\theta) \\ & \propto \exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u} + \sum \log \pi(y_i|\eta_i,\theta)\right\}\\ \end{aligned} \]
\[ \begin{aligned} \pi(\mathbf{u}\mid {\theta},\mathbf{y}) & \propto \pi(\mathbf{u}\mid {\theta}) \pi(\mathbf{y}|\mathbf{u},\theta) \\ & \propto \exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u} + \sum \log \pi(y_i|\eta_i,\theta)\right\}\\ \end{aligned} \]
\[ \begin{aligned} & \approx \exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u} + \sum (b_i\eta_i - \frac{1}{2}c_i\eta_i^2)\right\} \end{aligned} \]
Recall \[ \begin{aligned} f(x) \approx f(x_0) + f'(x_0)(x-x_0)+ \frac{1}{2} f''(x_0)(x-x_0)^2 = a+ bx - \frac{1}{2}cx^2 \end{aligned} \] with
\[ \begin{aligned} b &=f'(x_0) - f''(x_0)x_0\\ c &= -f''(x_0) \end{aligned} \].
\[ \begin{aligned} \pi(\mathbf{u}\mid {\theta},\mathbf{y}) & \propto \pi(\mathbf{u}\mid {\theta}) \pi(\mathbf{y}|\mathbf{u},\theta) \\ & \propto \exp\left\{-\frac{1}{2}\mathbf{u}^T\mathbf{Q}\mathbf{u} + \sum \log \pi(y_i|\eta_i,\theta)\right\}\\ & = \exp \left\{ -\frac{1}{2} \mathbf{u}^T\tilde{\mathbf{Q}} \mathbf{u} + \tilde{\mathbf{b}}\mathbf{u}\right\} \end{aligned} \] Which means that we approximate \(\pi(\mathbf{u}\mid {\theta},\mathbf{y})\) with
\[ \mathcal{N}(\tilde{\mathbf{Q}}^{-1}\tilde{\mathbf{b}},\tilde{\mathbf{Q}}^{-1})\ \text{ with }\ \tilde{\mathbf{Q}} = \mathbf{Q} +\begin{bmatrix} \text{diag}(\mathbf{c}) & 0\\ 0& 0 \end{bmatrix}\qquad \tilde{\mathbf{b}} = [\mathbf{b}\ 0]^T \]
Assume \[ \begin{aligned} y|\lambda \sim\text{Poisson}(\lambda) & \text{ Likelihood}\\ \eta = \log(\lambda) & \text{ Predictor}\\ \mathbf{u} = \eta & \text{ Latent field }\\ x\sim\mathcal{N}(0,1) & \text{ Latent Model} \end{aligned} \] we have that
\[ \begin{aligned} \pi(x|y) & \propto\pi(y|x)\pi(x)\\ & \propto\exp\{ -\frac{1}{2}x^2+ \underbrace{xy-\exp(x)}_{\text{non-gaussian part}} \} \end{aligned} \]
\[ \begin{aligned} \pi(\mathbf{\theta} \mid \mathbf{y}) & \propto \frac{ \pi( \mathbf{y}\mid \mathbf{u},\mathbf{\theta})\ \pi(\mathbf{\theta}) \pi( \mathbf{u}\mid \mathbf{\theta}) } {\pi(\mathbf{u} \mid \mathbf{y}, \mathbf{\theta})}\\ & \approx \frac{ \pi( \mathbf{y}\mid \mathbf{u},\mathbf{\theta})\ \pi(\mathbf{\theta}) \pi( \mathbf{u}\mid \mathbf{\theta}) } {\pi_G(\mathbf{u} \mid \mathbf{y}, \mathbf{\theta})} \: \Bigg|_{\mathbf{u} = \mathbf{u}^\star(\mathbf{\theta})} \end{aligned} \]
If \(\theta\) is low dimension we can actually compute this!
This is the Laplace approximation to \(\pi(\theta|\mathbf{y})\) !
Now we have an aproximation for \(\pi(\theta|\mathbf{y})\)
We need an approximation for \(\pi(u_i|\theta,\mathbf{y})\)
This is more difficult… \(\mathbf{u}\) is large!! Anything we do we have to do it many times!
Can we use the marginal from \(\pi_G(\mathbf{u}|\theta, \mathbf{y})\) ?
From Rue and Martino 2007
Notes:
NOTE:
Although INLA mainly computes posterior marginals
INLA allowes to sample from the approximate posterior \(\pi(\mathbf{u}|\mathbf{y})\)
This is important is one is interested in the posterior of functions of one or more parameters.
But ultimately….
You do not need to deeply understand all of this if you want to use inlabru to solve your (statistical) problems 😄
Recall… we have computationally efficient and flexible methodology…
Using INLA + SPDE approach we can:
The INLA methodology is implemented in two R packages: R-INLA and inlabru.
Why do we (strongly) recommend inlabru:
wrapper around R-INLA + extra functionality.
much easier to fit spatial models
full support for sf and terra libraries
makes fitting of tricky models (e.g. multi-likelihood models, spatial point processes) less fiddly than R-INLA
takes observation process into account (e.g. for distance sampling, plot sampling); more on this later
allows for non-linear predictors defined by general R expressions (this is done by linearizing and running the INLA approximation several times…)